Hw1

Question 1: Chromosome structures

Download the chomosome size files for the following genomes (Note these have been preprocessed to only include main chromosomes):

  1. Arabidopsis thaliana (TAIR10) - An important plant model species [info]
  2. Corn (Zea mays B73v4)) - The most widely grown crop in the world [info]
  3. E. coli (Escherichia coli K12) - One of the most commonly studied bacteria [info]
  4. Fruit Fly (Drosophila melanogaster, dm6) - One of the most important model species for genetics [info]
  5. Human (hg38) - us :) [info]
  6. Rice (Oryza sativa, IRGSP-1.0) - One of the most important crops in the world [info]
  7. Worm (Caenorhabditis elegans, ce10) - One of the most important animal model species [info]
  8. Yeast (Saccharomyces cerevisiae, sacCer3) - an important eukaryotic model species, also good for bread and beer [info]

Using these files, make a table with the following information per species:

  • Question 1.1. Total genome size
  • Question 1.2. Number of chromosomes
  • Question 1.3. Largest chromosome size and name
  • Question 1.4. Smallest chromosome size and name
  • Question 1.5. Mean chromosome length

Answers:

TAIR10 zm4 ecoli dm6 hg38 rice ce10 yeast
Total genome size 119146348 2106338117 4639211 137547960 3088269832 373245519 100286070 12157105
Number of chromosomes 5 10 1 7 24 12 7 17
Largest chromosome size and name Chr1: 30427671 1: 307041717 Ecoli: 4639211 chr3R: 32079311 chr1: 248956422 Chr1: 43270923 chrV: 20924149 chrIV: 1531933
Smallest chromosome size and name Chr4: 18585056 10: 150982314 Ecoli: 4639211 chr4: 1348131 chr21: 46709983 Chr9: 23012720 chrM: 13794 chrM: 85779
Mean chromosome length 23829369.6 210633811.7 4639211.0 19649708.6 128677909.7 31103793.2 14326581.4 715123.8

Solutions:

Codes used are shown here:

#!/usr/bin/env python3  
import sys  
f = sys.stdin  
line = f.readline()  
dic = {}  
while line != '':  
    line = line.strip().rstrip('\n').split()  
    dic[line[0]]=int(line[1])  
     ###int is really really really important!!!!  
    line = f.readline()  
f.close()  
print('number of chromosomes:',len(dic))  
  
n = 0  
for i in dic:  
    n += int(dic[i])  
print('total length:',n)  
  
b = max(dic.values())   # largest size  
c = list(dic.keys())[list(dic.values()).index(b)]   #coresponding name  
print('largest chromosome size and name: %s %s'%(c,b))  
  
b = min(dic.values())   # smallest size  
c = list(dic.keys())[list(dic.values()).index(b)]       #coresponding name  
print('smallest chromosome size and name: %s %s'%(c,b))  
  
print('mean chromosome length:',format(n/len(dic),'.1f'))  

Question 2: Sequence content

Download the yeast genome from here: http://schatz-lab.org/appliedgenomics2019/assignments/assignment1/yeast.fa.gz

  • Question 2.1. How many As, Cs, Gs, Ts are found in the entire genome
  • Question 2.2. Make a scatterplot of the %GC of 100bp windows across the genome: x-axis = genome location, y-axis = (#G + #C) / 100. For this analysis the chromsomes can be concatenated together to form a long string of the chromosomes in numerical order: chr1, chr2, … chrN. Make sure to draw a bar to indicate the ends of chromosomes
  • Question 2.3. Make a histogram of the number of genomic bins of a given %GC: x-axis = %GC, y-axis = # genomic bins with this %GC
  • Question 2.4. Recall that Illumina sequencing performs poorly when the %GC is <= 30% or >= 65%. Based on the analysis from Q2.2, what fraction of the genome do you expect to sequence poorly?

Answers:

Question 2.1.
Question 2.2.